Obesity infographics:
Hadley Wickham (paraphrased):
Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.
lmplot in seaborn - lm stands for linear modelThe linear model:
\[ Y = Xb + e \]
where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”
statsmodels package, which installs with AnacondaLet’s get an example ready:
import pandas as pd
import seaborn as sns
sns.set_context('notebook')
import statsmodels.formula.api as smf
df = pd.read_csv('http://personal.tcu.edu/kylewalker/data/texas_colleges.csv')
df['grad_rate'] = df.grad_rate * 100
f = smf.ols(formula = 'median_earn ~ grad_rate', data = df).fit()
f.summary()
f2 = smf.ols(formula = 'median_earn ~ grad_rate + sat_avg', data = df).fit()
f2.summary()
df['fitted'] = f2.predict()
df['resid'] = f2.residimport cufflinks as cf
cf.go_offline()
df.iplot(x = 'fitted', y = 'resid', kind = 'scatter', mode = 'markers',
text = 'instnm', zerolinecolor = 'red', color = 'blue',
xTitle = 'Fitted values', yTitle = 'Residuals')Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
scikit-learnimport numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
df = pd.read_csv('http://personal.tcu.edu/kylewalker/data/dec8.csv', index_col = 'name')
df.head()np.random.seed(1983)
km = KMeans(n_clusters = 7).fit(df)
df['clusters'] = km.labels_
# Check TCU's cluster
df.ix['Texas Christian University'] from ipywidgets import interact
def glimpse_clusters(cluster_id):
sub = df[df.clusters == cluster_id]
print(sub.head(20))
interact(glimpse_clusters, cluster_id = (0, 6))neigh = NearestNeighbors(n_neighbors = 5)
# "Training" the model
neigh.fit(df)
# Searching for neighbors
model = neigh.kneighbors(df, return_distance = False)
results = pd.DataFrame(model, columns = ['x1', 'x2', 'x3', 'x4', 'x5'])
merged = pd.merge(df.reset_index(), results, right_index = True, left_index = True)def find_neighbors(university):
d = merged[merged.name == university].reset_index()
for x in ['x2', 'x3', 'x4', 'x5']:
idx = d.iloc[0][x]
m = merged.ix[idx]
print(m['name'])
interact(find_neighbors, university = 'Texas Christian University')